Contact:(+44) 07967741721, stephen.newhouse@kcl.ac.uk, stephen.j.newhouse@gmail.com
Important: This document is best viewed as a html file in any web browser
Basic analysis of NGS data for clinical genetics involves the following basic steps:
It is essential to assess data quality and performance at each step.
At the end of this workshop you will be able to:
Work through this document at your own pace.
Please read everything carefully. Ask questions and have fun!
At the begining of some of the sections I list the main software called by Galaxy and their basic input and output and output formats.
Galaxy is an ongoing project, bioinformatics software are still being developed, fixed, tweaked and improved upon. As a result, some of the screenshots in the document may not exactly match what you see on the live usegalaxy.com website or the Galaxy instance we provide on the Genomics England compute environment. Software versions may differ also. Don’t panic. Keep calm. The workflow/pipeline and tools we use will stay the same - there may just be some subtle differences. Use the search bar in Galaxy to find the tool you need. Ask questions. Google it. Report any bugs and typos.
The sequence data that you will be analysing is from a 3-year-old male who presented with difficulty walking after the first year of life, followed by signs of muscle wasting and weakness, muscle rigidity, developmental delays, progressive loss of vision leading to blindness, convulsions, impaired swallowing, paralysis, and dementia. An MRI scan showed a “tigroid” or “leopard-skin” appearance within the deep white matter - recognizable abnormalities of white matter. Further clinical tests suggest a diagnosis of Metachromatic leukodystrophy (MLD, also called Arylsulfatase A deficiency): is a lysosomal storage disease which is commonly listed in the family of leukodystrophies as well as among the sphingolipidoses as it affects the metabolism of sphingolipids.
The patient’s exome was sequenced using the paired-end method on an Illumina HiSeq 2500 following target enrichment using the Nextera Rapid Capture Exome and Expanded Exome kit.
This data is simulated - but it does represents a real life scenario
workshop_data_2018 (or something memorable)The Workshop Data
| FILE | DESCRIPTION | URL |
|---|---|---|
| WES01_chr22m_R1.fastq.gz | Illumina fastq file read 1 | click me |
| WES01_chr22m_R2.fastq.gz | Illumina fastq file read 2 | click me |
| chr22.genes.hg19.bed | annotation bed file | click me |
You should have a folder that looks somthing like:-
You will be using Galaxy to construct and run your NGS pipeline. Galaxy is an open source, web-based platform for data intensive biomedical research. If you are new to Galaxy start here and/or consult the help resources:
Learning Resources
Getting more help on Galaxy
You should have all registered for a Galaxy account!
A lot of useful information is displayed in the centre panel. Don’t ignore it!
An example is shown below. This is for the software Freebayes: a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
You will get used to this as you work through the workshop
workshop_data_2018 on your local computer and select all 3 files and click open .Paired End Sequence data : FASTQ Format
One lane of paired end sequencing was performed so you have two files of raw sequence data (WES01_chr22m_R1.fastq.gz and WES01_chr22m_R2.fastq.gz) which contain all the sequence data for read 1 (R1) and read 2 (R2) respectively (Figure: Paired End Sequence data). To save on computing time and disk space, the NGS data for WES01 has been filtered to contain reads mapping to chromosome 22 only and the files have been compressed (hence the .gz extension).
Figure: Paired End Sequence data
The raw NGS data is held in fastq format: http://en.Wikipedia.org/wiki/FASTQ_format. Fastq files are the simplest and most generic way of storing read sequences and qualities.
Figure: Fastq Data
@HWI-D00119:50:H7AP8ADXX:1:1212:10369:36657/1
GCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAATATTTTCTTATTAACTTCC
+
CCCFFFFFHHHHHJJJHIJJJJJJJJJJJJJJJJHFGIIIGJJJJIIHHHHHHFFFFFFFEEEEEEEEEDDDDDDDDDDEDDDDFFEEEDDDDEEEDDEDD
Illumina sequence identifiers
Sequences from the Illumina software use a systematic identifier.
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
| Value | Description |
|---|---|
| EAS139 | the unique instrument name |
| 136 | the run id |
| FC706VJ | the flowcell id |
| 2 | flowcell lane |
| 2104 | tile number within the flowcell lane |
| 15343 | ‘x’-coordinate of the cluster within the tile |
| 197393 | ‘y’-coordinate of the cluster within the tile |
| 1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
| Y | Y if the read is filtered, N otherwise |
| 18 | 0 when none of the control bits are on, otherwise it is an even number |
| ATCACG | index sequence |
Note that more recent versions of Illumina software output a sample number (as taken from the sample sheet) in place of an index sequence. For example, the following header might appear in the first sample of a batch:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
Phred Quality Scores
A Phred quality score is a measure of the quality of the identification of the sequence of bases generated by automated DNA sequencing.
A phred score or quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).
It is more efficient to store the sequence quality scores as characters because they require less disk space than numbers (i.e. 1 character versus multiple digits). Characters are also more reliable and portable than numbers. Sequence quality scores on the phred scale are determined by mapping the ASCII character to its associated number.
BED File Annotaion data: chr22.genes.hg19.bed
The uploaded file chr22.genes.b37.bed describes the exome sequences and genes that have been targeted in your trial data - in .bed format. A BED file (.bed) is a tab-delimited text file that defines genomic features. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat.
Briefly, the BED format consists of one line per feature, each containing 3-12 columns of data, plus optional feature definition lines. A minimal .bed file uses 3-4 tab delimited columns of data to describe the target:
The first three fields in each feature line are required:
chr22.genes.hg19.bedchr22 16256331 16256677 POTEH-chr22-16256332-16256677
chr22 16258184 16258303 POTEH-chr22-16258185-16258303
chr22 16266928 16267095 POTEH-chr22-16266929-16267095
chr22 16269872 16269943 POTEH-chr22-16269873-16269943
chr22 16275206 16275277 POTEH-chr22-16275207-16275277
chr22 16277747 16277885 POTEH-chr22-16277748-16277885
chr22 16279194 16279301 POTEH-chr22-16279195-16279301
chr22 16287253 16287937 POTEH-chr22-16287254-16287937
chr22 16448825 16449804 OR11H1-chr22-16448826-16449804
chr22 17071647 17073700 CCT8L2-chr22-17071648-17073700
chr22.genes.hg19.bed Type = bedWES01_chr22m_R1.fastq.gz Type = fastqsangerWES01_chr22m_R2.fastq.gzType = fastqsangerhg19 and select Human Feb. 2009 (GRCh37/hg19)
Software
Tools
Input/Output
FASTQ_Format: A FASTQ file normally uses four lines per sequence.
Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
Illumina sequence identifiers
Sequences from the Illumina software use a systematic identifier:
@HWUSI-EAS100R:6:73:941:1973#0/1
| ID | Description |
|---|---|
| HWUSI-EAS100R | the unique instrument name |
| 6 | flowcell lane |
| 73 | tile number within the flowcell lane |
| 941 | ‘x’-coordinate of the cluster within the tile |
| 1973 | ‘y’-coordinate of the cluster within the tile |
| #0 | index number for a multiplexed sample (0 for no indexing) |
| /1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.
With Casava 1.8 the format of the ‘@’ line has changed:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
| ID | Description |
|---|---|
| EAS139 | the unique instrument name |
| 136 | the run id |
| FC706VJ | the flowcell id |
| 2 | flowcell lane |
| 2104 | tile number within the flowcell lane |
| 15343 | ‘x’-coordinate of the cluster within the tile |
| 197393 | ‘y’-coordinate of the cluster within the tile |
| 1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
| Y | Y if the read is filtered, N otherwise |
| 18 | 0 when none of the control bits are on, otherwise it is an even number |
| ATCACG | index sequence |
Note that more recent versions of Illumina software output a sample number (as taken from the sample sheet) in place of an index sequence. For example, the following header might appear in the first sample of a batch:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
1. The unique instrument name
2. The run id
3. The flowcell id
4. The flowcell lane
Fastq Header Explaination
Header: @HWI-D00119:50:H7AP8ADXX:1:1212:10369:36657/1
| | | |
mapping: 1 2 3 4
To assess the quality of the raw sequence data and to guide quality control we will use a program called FastQC. The program outputs summary graphs and tables that show if there are any problem areas, which could influence assembly or variant calling if not addressed.
You can learn more about the program here (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
The FastQC reports will be shown in the History Pane. Queued jobs in grey with a clock symbol, running jobs in yellow with a buffering symbol, finished jobs in green, failed jobs in red with a cross. .
Use the FastQC reports on data 1 and 2 to answer:
Looking at the per base sequence quality you will notice that the average base quality drops towards the end of reads (Figure above). This drop in quality is typical for ensemble-based sequencing by synthesis (SBS) methods, such as Illumina, which add complimentary bases one at a time in a cluster of identical sequences to determine a consensus sequence from the ‘average’ sequence signal over all copies in the cluster. As nucleotides are added some of the sequences in a cluster grow at a different rate and become desynchronized which reduces the accuracy of the ‘average’ sequence signal
Another particularly important plot is that of ‘Overrepresented sequences’, which lists sequences that account for more than 0.1% of the total. The presence of an overrepresented sequence suggests that the sequence is biologically significant, or that the library is contaminated or has low diversity. To check for contamination, each overrepresented sequence is compared to a database of common contaminants such as sequencing adaptors which can then be removed from the raw FastQ data.
In a typical analysis you may want to raise technical issues identified by FastQC such as low read count, poor quality, and overrepresented sequences with the data provider. To ensure that only data of a certain quality is used for further analysis we will exclude low quality reads. In paired-end data there are two fastq files per lane of sequencing which are synchronised so that matching pairs are stored in the same line of each file (e.g. the read in line 1, file 1 is paired with the read in line 1, file 2 and so on).
To maintain this order when filtering, we will filter the fastq files using TRIMMOMATIC (Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.) the fastq files need to be joined and the reads have to be removed as a pair.
Nextera adapters are:
- Read1 CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
- Read2 CTGTCTCTTATACACATCTGACGCTGCCGACGA
Additional Reference: what-sequences-do-i-use-for-adapter-trimming
You should now have set up 2 Trimmomatic Operations
For paired-end data a particular strength of Trimmomatic is that it retains the pairing of reads (from R1 and R2) in the filtered output files:
Two FASTQ files (R1-paired and R2-paired) contain one read from each pair where both have survived filtering. Additionally two FASTQ files (R1-unpaired and R2-unpaired) contain reads where one of the pair failed the filtering steps.
When the job has successsfully run you should see something like:
Software
Tools
Input/Output
We will use a program called BWA-MEM (Burrows-Wheeler Alignment) to align the sequence data to the human genome. You can learn more about the bwa program here (http://bio-bwa.sourceforge.net/).
This next step is very important!
HWI-D0011.50.H7AP8ADXX.1.WES01WES01ILLUMINAnextera-wes01-bloodHWI-D001192017-02-23IMPORTANT:READ: https://galaxyproject.org/learn/galaxy-ngs101/#read-groups for more information on read groups. .
The alignment process maps the read data to the reference human genome creating a Binary Alignment/Map file (BAM). The BAM file is not directly viewable , clicking the view button will download the file.
IMPORTANT:READ: https://galaxyproject.org/learn/galaxy-ngs101/#understanding-and-manipulating-sambam-datasets for more information.
Software
Tools
tool_optiontool_optiontool_optionInput/Output
MarkDuplicates input: bwa aligned BAM filepicard tools MarkDuplicates output: duplicate read marked BAM file
view input: duplicate read marked BAM filesamtools view output: Filtered duplicate read marked BAM file and related bai index
Flagstat input: Filtered duplicate read marked BAM filesamtools Flagstat output: text file
IdxStats input: Filtered duplicate read marked BAM filesamtools IdxStats output: text file
CollectInsertSizeMetrics input: Filtered duplicate read marked BAM filepicard tools CollectInsertSizeMetrics output: text and pdf (a plot) file
coverageBed input:
.bed file
bedtools output: text file
For variant calling, reads with low mapping quality (phred <20), unmapped reads, secondary alignments, reads failing platform/vendor quality checks, and duplicate reads that have the same start and stop position are typically removed or ignored because they can influence genotyping accuracy. For example, at heterozygous sites the two alleles should be evenly distributed (50% of reads have the A allele and 50% have the B allele) but if reads with the A allele are duplicated the A allele will become overrepresented and the site might be misinterpreted as homozygous for the A allele.
We will Use 1) NGS: Picard > MarkDuplicates and then 2) SAMtools > Filter SAM or BAM, output SAM or BAM files to make a new BAM file which flags duplicate molecules and excludes poor quality reads.
INFO: How PCR duplicates arise in next-generation sequencing. Duplicates originate mostly from DNA prep methods an cause biases that skew variant calling results.
This tool examines aligned records in the supplied SAM or BAM dataset to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged. Two files are produced, 1) the new BAM file with duplicate reads marked and 2) a metrics file summarising the number of duplicate reads found.
This tool uses the samtools view command in SAMtools toolkit to filter a SAM or BAM file on the MAPQ (mapping quality), FLAG bits, Read Group, Library, or region.
Input is either a SAM or BAM file.
The output file will be SAM or BAM (depending on the chosen option), filtered by the selected options.
When aligning reads to the reference genome anywhere between 0 to 20% of reads are not aligned due to sequencing errors, sample contamination (e.g. bacterial or viral DNA), gaps in the reference genome and genome variation. Use SAMtools Flagstat to determine how many reads have been aligned. We will compare the duplicate marked bam file with the final filtered bam file.
Click the view data icon for both and use the Flagstat output to determine the percentage of mapped reads, and calculate the number of reads that were filtered out (difference in total read count between the raw and filtered BAM files).
Before Filtering
After Filtering
Converting BAM to SAM file:As mentioned above, the BAM file is held in binary format and so it can not be viewed directly. However, the BAM file can be converted to a viewable text format known as a Sequence Alignment/Map file or SAM file. The SAM file format is described in detail here (http://samtools.github.io/hts-specs/SAMv1.pdf).
Use BAM to SAM to convert the filtered BAM file to a SAM file.
Spend some time looking at this file and its format.
Sam format: basic column information: In the SAM file, lines starting with “@” are headers, and those without “@” are read sequences aligned to the reference.
| SAM_Column | Description |
|---|---|
| QNAME | Query name of the read or the read pair |
| FLAG | Bitwise flag (pairing |
| RNAME | Reference sequence name |
| POS | 1-Based leftmost position of clipped alignment |
| MAPQ | Mapping quality (Phred-scaled) |
| CIGAR | Extended CIGAR string (operations MIDNSHP) |
| MRNM | Mate reference name (‘=’ if same as RNAME) |
| MPOS | 1-based leftmost mate position |
| ISIZE | Inferred insert size |
| SEQQuery | Sequence on the same strand as the reference |
| QUAL | Query quality (ASCII-33=Phred base quality) |
I have extracted two reads for you to look at. Note that they both have the same read id. HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573
HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573 99 chr22 16123342 60 101M = 16123407 166 CTACCTCCAGAGCATCAGGCCTTGCCTGGCCCTGGGCAGAGAGCTGGTCTACAACCGGCGGCCTGGGGACGAGGGCACTGTCATGTCTCTAGCTGGGAAAT CCCFFFFFHHHHHIJIJJJIJJJIJIIJIIJJIJIIJJIJJJJIIIJHJJJJJIGGHIGFDDDDDDDDDDDD@BBDDDDDDDDDDCDEDDEDDDDDDDCDD XA:Z:chr14,-19719869,101M,2;chr14,+19855438,101M,2; MD:Z:101 PG:Z:MarkDuplicates RG:Z:HWI-D0011.50.H7AP8ADXX.1.WES01 NM:i:0 AS:i:101 XS:i:91
HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573 147 chr22 16123407 60 101M = 16123342 -166 GGGACGAGGGCACTGTCATGTCTCTAGCTGGGAAATACACATTGAACAACTGGTTGGCAACGTTAACGTTGAGCCAGGCGGGCATGCACGCAACATACTAC B@DDDDDDDDDDDDDDDDDDDDDDEEDEEFFEFFFEEHHHHFHGJJJJJJIIIIJJIIJJIJJJIGGJHEHEGHHFIIJJIJJIJIJJHHHHHFFFFF@B@ XA:Z:chr14,+19719804,101M,2;chr14,-19855503,101M,3;chr2,+132481028,101M,3; MD:Z:101 PG:Z:MarkDuplicates RG:Z:HWI-D0011.50.H7AP8ADXX.1.WES01 NM:i:0 AS:i:101 XS:i:91
SAM/BAM Format mapping:
QNAME: HWI-D00119:50:H7AP8ADXX:1:2202:7515:35573
FLAG: 147
RNAME: chr22
POS: 16123407
MAPQ: 60
CIGAR: 101M
MRNM: =
MPOS: 16123342
ISIZE: 166
SEQQuery: GGGACGAGGGCACTGTCATGTCTCTAGCTG…
QUAL: B@DDDDDDDDDDDDDDDDDDDDDDEEDEEF…
SAM/BAM Flags Explained
Use this website (http://broadinstitute.github.io/picard/explain-flags.html) to decode the flags (99 and 147) of the reads and determine which strand of the reference genome they are aligned with.
When viewing the SAM file you may have noticed that not all the reads are mapped to chromosome 22 which is surprising as the sequence was initially filtered for reads mapping to chromosome 22 only. Use the IdxStats to generate a breakdown of the number of reads that map to each chromosome or contig.
NGS: SAMtools > IdxStats
This will produce a file with 4 columns; 1) Reference sequence identifier. 2) Reference sequence length. 3) Number of mapped reads. 4) Number of placed but unmapped reads (typically unmapped partners of mapped reads)
Use the IdxStats output to calculate the percentage of reads mapping to chromosome 22.
Insert sizes (the region between the 5’ ends of the paired reads see Figure belwo are important for correct alignment and variant calling. During alignment with BWA-MEM, we estimated that the mean insert size was 250bp so the program expected reads to be separated by this distance plus or minus the standard deviation. For variant calling, reads are assumed to be independent. However, if an insert is smaller than the sum of the read pairs the reads will overlap and not be independent in the overlapping region. If a PCR error occurs in this overlapping region it will be present in both reads which may result in there being enough evidence to call a variant at this site that is not real.
Overalpping reads duplicating a PCR error shown in yellow
Use Picard CollectInsertSizeMetrics to produce some statistics and a histogram of the insert size.
NGS: Picard > CollectInsertSizeMetrics
This tool will produce two output files. The first is a tabular output with some statistics and a list of insert sizes and counts. The second output is a .pdf with a insert size histogram, that should look like the figure on the next page;
View the tabular output and record the mean insert size and standard deviation
Insert size histogram
Identification and accuracy of variant calling relies on the depth and breadth of sequence coverage. To calculate coverage, a file in bed format describing the target region is required. Bed files use tab delimited columns of data to describe the target: chromosome, left location (bp), right location (bp).
We will used the uploaded bed file which describes the exome sequence that has been targeted in your trial data “chr22.genes.hg19.bed”.
The options we selected will report the per-base of coverage for each feature in the chr22.genes.hg19.bed.
The output will consist of a line for each one-based position in each feature, followed by the coverage detected at that position.
the output explained: The 6th (last) column is the Depth at position X (Column 5) for the genomic feature (exon) A (columns 1-4).
chr22 16256331 16256677 POTEH-chr22-16256332-16256677 1 16
chr22 16256331 16256677 POTEH-chr22-16256332-16256677 10 19
chr22 16256331 16256677 POTEH-chr22-16256332-16256677 100 90
For line 1: the depth at the first base (indicated by col 5) of the feature/exon POTEH-chr22-16256332-16256677 is 16.
For line 2: the depth at the 10th base (indicated by col 5) of the feature/exon POTEH-chr22-16256332-16256677 is 18.
For line 3: the depth at the 100th base (indicated by col 5) of the feature/exon POTEH-chr22-16256332-16256677 is 90.
Note:order in column 5. This is not numerical or in order. This is a quirk of sorting bed files
View the new sorted data It should look like:-
Mean Coverage: We can calculate the mean coverage of the targeted regions by averaging over the values in column 6:
View the data It should look like:-
What is the average depth of coverage?
Software
Tools
Input/Output
It is useful to look at the aligned data as this is one way to identify variants, assess their quality, and determine their genomic context with respect to other annotated features of the human genome such as genes, repeat sequences, transcription factor binding sites etc. We therefore recommend alignment visualisation before validation of in-silico variants by independent sequencing methods. The Integrative Genome Viewer (IGV) is a popular tool for interrogating aligned NGS data (look here for more details on IGV: www.ncbi.nlm.nih.gov/pubmed/22517427).
You can run IGV from Galaxy or Download and install your own local version. If this does not work on the University network computers, don’t worry, try this at home on your own computer. Your galaxy sessions will remain as long as you don’t delete it.
Visualizing multiple datasets in Integrated Genome Viewer (IGV) - https://wiki.galaxyproject.org/Learn/GalaxyNGS101#Visualizing_multiple_datasets_in_Integrated_Genome_Viewer_.28IGV.29
Galaxy has a number of display applications allowing visualization of various datasets. IGV (integrative genome viewer) is one of the most versatile applications for looking at positional genomic data. In Galaxy you can view Interval, BED, BAM, and VCF datasets in IGV. The screencast Video NGS101-12 = Displaying multiple tracks in IGV : https://vimeo.com/123414437 shows how to do this.
Alternatively
View the Gene SMARCB1
SMARCB1 is a tumour suppressor gene that regulates cell cycle, growth and differentiation. An inactivating germline mutation in exon 1 of SMARCB1 has been reported in patients with schwannomatosis (http://omim.org/entry/162091). Schwannomas are mostly benign tumours involving schwann cells that myelinate the axons of nerve cells but can cause problems if the tumour compresses a nerve. There are several cases where people with schwannomatosis have developed hearing loss due to an acoustic neuroma, which is a schwannoma on the vestibular nerve in the brain that is involved in hearing.
Things to think about: What does the coverage look like? Are all exons equally captured? Zoom in on an exon and get familiar with IGV and viewing the BAM file.
A Variant in SMARCB1
Software
Tools
tool_optionInput/Output
vcffilter input: vcf filevcffilter output: filtered vcf fileAccurate and consistent variant calling requires statistical modelling and is essential for the clinical implementation of NGS. However, many programs are available for calling variants and their concordance varies. Furthermore, variants have different levels of confidence due to differences in data quality. For variants with intermediate confidence levels, it is difficult to separate true variation from artefacts that arise from many factors such as sequencing error, misalignment and inaccurate base quality scores. As a result, the evidence for variant calls requires scrutiny and caution should be used when interpreting positive and negative findings especially for indels which are more error prone. At the end of this exercise you will be able to:
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. See https://github.com/ekg/freebayes for details on FreeBayes.
FreeBayes uses short-read alignments (BAM files with Phred+33 encoded quality scores, now standard) for any number of individuals from a population and a reference genome (in FASTA format) to determine the most-likely combination of genotypes for the population at each position in the reference. It reports positions which it finds putatively polymorphic in variant call file (VCF) format. It can also use an input set of variants (VCF) as a source of prior information, and a copy number variant map (BED) to define non-uniform ploidy variation across the samples under analysis.
FreeBayes emits a standard VCF 4.1 output stream. This format is designed for the probabilistic description of allelic variants within a population of samples, but it is equally suited to describing the probability of variation in a single sample.
Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth.
By default, records are output even if they have very low probability of variation, in expectation that the VCF will be filtered.
-f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1"See: freebayes in depth: model, filtering, and walkthrough
The Freebayes Information Fields we used for filtering
What calls do we “know” are poor (w.r.t. freebayes VCF annotations)?
- low-quality calls (QUAL < N)
+ also, maybe QUAL is high but QUAL / AO is low
- loci with low read depth (DP < N)
- alleles that are only seen on one strand
+ remove by “SAF > 0 & SAR > 0”
- alleles that are only observed by reads placed to the left or right
+ remove by “RPL > 0 & RPR > 0”
suggested freebayes hard filters…For diploid, humans…
QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1
Look at the Filtered VCF File
How many variants are there after filtering?
| before | after |
|---|---|
Descriptive statistics and quality control on variants
Software
Tools
Input/Output
At this point in the analysis of whole exome data from patient WES01, we have assessed the quality of raw sequence data, removed low quality reads, aligned sequence data to the reference genome, removed poorly mapped reads and duplicate reads, assessed the alignment process, called variants and evaluated variant calling. The aim of this practical is to complete the analysis of WES01 by annotating the variants and using a filtering strategy to generate a short-list of potentially pathogenic variants. At the end of this exercise you will be able to:
The result of variant calling is a variant call file (VCF) which describes the location and data quality of each variant. However, the initial VCF file does not provide any information about the functional relevance of variants and their potential contribution to disease. To gain these insights we will use ANNOVAR (Wang et al 2010) to annotate each variant in the VCF file with respect to their location in relation to genes and coding sequences (exon, intron or intergenic), whether they change the coding sequence and if so how (missense, stop gain, synonymous, frameshift, amino acid consequence etc). In addition, we will cross reference the variants with databases of known variation (1000 genomes, dbSNP, Exome Sequencing Project and COSMIC) and predictions of functional significance (avsift and conservation scores). At this stage, it is important to be aware that the annotation result will vary according to the choice of annotation software (alternative software: SnpEff Cingolani et al 2012 and Variant Effect Predictor VEP McLaren et al 2010) and definition of transcripts (Ensemble Flicek et al 2012, RefSeq Pruitt et al 2012, Gencode Harrow et al 2012). This variability occurs because a single variant may have different effects in different transcripts (isoforms of the gene) and in some cases may effect different genes (genes can overlap, one on forward strand the other on the reverse strand) each with multiple transcripts. It is also possible to interpret a single variant as having multiple effects on a single transcript (Figure 1). The annotation software must therefore apply a set of rules to determine which variant takes precedence see here for ANNOVAR precedence rules but these rules vary between programs so that they generate different results. For example, the overall agreement for Loss of Function (LoF) variants annotated by ANNOVAR and VEP is just 64% (McCarthy et al 2014 : Choice of transcripts and software has a large effect on variant annotation).
Annotation examples from McCarthy et al 2014
The variant is an insertion of an A in a stop codon (TGA) in the penultimate base of the exon in all transcripts except one. It is possible to interpret this variant as a frameshift insertion or a stop-loss but it is actually a synonymous variant because the stop codon remains in the same place.
The choice of which transcript set to use for annotation has an even bigger effect on the annotation results. For example, the overlap between LoF variants is only 44% when the same software is used to annotate against transcripts from Ensembl or RefSeq (McCarthy et al 2014). This is because RefSeq transcripts are based on a collection of non-redundant mRNA models that have strong support and are manually curated. As a result, the RefSeq database is highly accurate but does not contain all possible transcripts or gene models (n=41,501 transcripts from RefSeq release 57 that were used by ANNOVAR). In comparison, Ensembl provides a less curated but more comprehensive list of transcripts (n=115,901 transcripts from Ensembl version 69 that were used by ANNOVAR) based on information from the Consensus Coding Sequence (CCDS, Pruitt et al 2009), Human and Vertebrate Analysis and Annotation (HAVANA), Vertebrate Genome Annotation (Vega, Ashurst et al 2005), ENCODE (The ENCODE Project Consortium 2012) and GENCODE (Searle et al 2010).
ANNOVAR is a rapid, efficient tool to annotate functional consequences of genetic variation from high-throughput sequencing data. wANNOVAR provides easy and intuitive web-based access to the most popular functionalities of the ANNOVAR software
This tool will annotate variants using specified gene annotations, regions, and filtering databases. Input is a VCF dataset, and output is a table of annotations for each variant in the VCF dataset or a VCF dataset with the annotations in INFO fields.
ANNOVAR Website: http://annovar.openbioinformatics.org
Paper: http://nar.oxfordjournals.org/content/38/16/e164
Now view the annotated VCF file by clicking ‘View data’ and familiarise yourself with the contents.
There are many ways to filter the data, one of the most common strategies is to select coding variants (c6 == ‘exonic’) that are either rare or absent from databases of known variation such as dbSNP (c20 == ‘.’).
c20 == '.' and c6 == 'exonic'The Case Study
The sequence data that you will be analysing is from a 3-year-old male who presented with difficulty walking after the first year of life, followed by signs of muscle wasting and weakness, muscle rigidity, developmental delays, progressive loss of vision leading to blindness, convulsions, impaired swallowing, paralysis, and dementia. An MRI scan showed a “tigroid” or “leopard-skin” appearance within the deep white matter - recognizable abnormalities of white matter. Further clinical tests suggest a diagnosis of Metachromatic leukodystrophy (MLD, also called Arylsulfatase A deficiency): is a lysosomal storage disease which is commonly listed in the family of leukodystrophies as well as among the sphingolipidoses as it affects the metabolism of sphingolipids.
The Filtered Variants
Based on the patients phenotypes and clinical diagnosis, which of the variants above do you think is the causative mutation? If you didn’t have detailed phenotype information and/or the diagnosis, would you have come to the same conclusion?
Download dataset and bam index Download bam_index| Filtered BAM | Filtered VCF |
|---|---|
For the causal variant use the VCF file from Freebayes and look at the Quality score and other fields used for filtering.
The online version of ANNOVAR offers some additional annotations and improved variant filtering and prioritisation using phenotype information.
A full Tutorial is avaiable here wANNOVAR
Download your VCF file and use wANNOVAR to perform annotation.
NOTE: If you’re running out of time to finish the practical you can read the section below in your own time and skip to the last section on how to create a workflow.
SIFT SIFT score for non-synonymous variants are based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences.
Polyphen-2 Polyphen-2 predicts the impact of an amino acid substitution on the structure and function of a protein. Predictions are based on features that characterise the substitution including the sequence (does the substitution occur within an annotated site in the protein e.g. active or binding site, non-globular region such as trans-membrane etc), phylogenetics (how often does the substitution occur in a family of related proteins) and structural information (could the substitution affect the proteins 3D structure e.g. hydrophobic core, electrostatic interactions, interactions with ligands or other important features). These features are assessed by a probabilistic classifier. Two versions of Polyphen are available, HDIV and HVAR, that used different datasets to train and test the prediction models. HDIV should be used for evaluating rare alleles involved in complex phenotypes where both disease causing and mildly deleterious alleles are treated as damaging. The HDIV dataset consists of all damaging alleles in the UniProtKb database that effect molecular function and cause human Mendelian diseases versus differences between human proteins and their closely related mammalian homologues which are assumed to be non-damaging.
HVAR should be used for diagnostics of Mendelian diseases which requires distinction between mutations with diagnostic effects from all remaining human variation, including a wealth of mildly deleterious alleles. The HVAR dataset used to train and test Polyphen consisted of all human disease causing mutations from UniProtKb together with common human non-synonymous SNPs (MAF>1%) without annotated involvement in disease which were treated as non-damaging.
LRT score Likelihood ratio test (LRT) for prediction of significantly conserved amino acid positions within the human proteome. Mutations are estimated as ‘deleterious’ that disrupt highly conserved aminoacid residues, ‘neutral’ or ‘unknown’.
Mutation Taster MutationTaster uses a Bayes classifier to predict the disease potential of an alteration. For this prediction, the frequencies of all single features for known disease mutations/polymorphisms were studied in a large training set composed of >390,000 known disease mutations from HGMD Professional and >6,800,000 harmless SNPs and Indel polymorphisms from the 1000 Genomes Project.
Mutation Assessor Mutation assessor scores estimate functional impact based on evolutionary conservation of the affected amino acids in protein homologs. The method has been validated on a large set (60k) of disease associated (OMIM) and polymorphic variants.
FATHMM score Functional Analysis through Hidden Markov Models. Predicts the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with “pathogenicity weights”, representing the overall tolerance of the protein/domain to mutations.
RadialSVM Support vector machine (SVM) based ensemble prediction score, which incorporates 10 scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Larger value means the SNV is more likely to be damaging. Scores range from -2 to 3 in dbNSFP.
LR score LR scores evaluate the deleteriousness of missense mutations by using logistic regression to integrate information from:
VEST3 score Variant Effect Scoring Tool (VEST) is a supervised machine learning-based classifier for prioritisation of rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement.
CADD raw Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection (differences between 1000 Genomes and the Ensembl Compara inferred human-chimpanzee ancestral genome) with simulated mutations. Raw CADD scores come straight from the model and have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values have relative meaning with larger scores indicating that a variant is more likely to have a damaging effect.
CADD phred
Phred CADD scores are ranked scores based on the whole genome CADD raw scores. For example, variants at the 10th-% of raw CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, etc. The results of this transformation are Phred-like scaled CADD scores.
Genomic Evolutionary Rate Profiling Rejected Substitution scores (GERP RS). GERP identifies constrained elements in multiple alignments by quantifying substitution deficits for SNPs. Scores range from -12.3 to 6.17, with 6.17 being the most conserved. Scores greater than 2 provide a high sensitivity while still strongly enriched for truly constrained sites.
PhyloP46way placental Phylogenetic conservation score based on a multiple alignment of 46 vertebrate genomes (10 primates, 33 placental mammals).
PhyloP100way vertebrate Phylogenetic conservation score based on a multiple alignment of 100 vertebrate genomes.
SiPhy 29way logOdds The estimated stationary distribution of A, C, G and T at the site, using SiPhy algorithm based on 29 mammals genomes.
Please work through and read:Creating Workflows and Advanced Workflow Options
Reproducing an experiment on the same data or different datasets and comparing the results is a key feature of research that requires keeping a meticulous set of instructions. In many cases, multiple replicates are required and the best approach is an automated analysis that is consistent and less error prone. In Galaxy, the history acts as a detailed list of experimental instructions that can be reproduced and automated by creating a ‘Workflow’ or bioinformatic pipeline.
Before making your workflow delete any unwanted steps from the history so that it only includes the steps needed to make the final result.
See Extract Workflow from a History: and follow the instructions there.
Name your workflow : Alignment_Variant_Calling_and_Filtering-v0.1 (or something similar, maybe add your initials)
Now that you have made a workflow/bioinformatic pipeline you can apply the whole process to a new dataset with just a few clicks. Try your workflow on the original fastq files.
In the history pane click the cog icon and select ‘Create New’ to make a new history.
Before running the workflows, we need to put the raw data files into the new history. You can do this by uploading or copying the data from an existing dataset. We will copy the input files by clicking the cog icon and selecting ‘Copy Datasets’
Select ‘WES01 NGS Workshop’ as the source history and the new ‘Unnamed history’ as the destination.
Check the boxes for the 3 input files that we need: 1) target bed file: chr22.genes.hg19.bed., raw fastqs: 2) WES01_chr22m_R1.fastq, 3) WES01_chr22m_R2.fastq.
Scroll to the bottom of the page and click ‘Copy History Items’.
Click the workflow tab then click your ‘Alignment_Variant_Calling_and_Filtering’ workflow and select run from the drop down menu.
chr22.genes.hg19.bed., raw fastqs: 2) WES01_chr22m_R1.fastq, 3) WES01_chr22m_R2.fastq to a new history Data Set Selection and Order: running Galaxy workflow
note the order of selected data sets
1 = *.bed file
2 = *.R1.fastq
3 = *.R2.fastq
When the workflow has finished view the output and check that the result matches your original analysis.
Based on a workshop produced by Will Tapper, University of Southampton (shared 2016)